Random texts exhibit Zipf's-law-like word frequency distribution

نویسنده

  • Wentian Li
چکیده

It is shown that the distribution of word frequencies for randomly generated texts is very similar to Zipf's law observed in natural languages such as the English. The facts that the frequency of occurrence of a word is almost an inverse power law function of its rank and the exponent of this inverse power law is very close to 1 are largely due to the transformation from the word's length to its rank, which stretches an exponential function to a power law function. Zipf observed long time ago [1] that the distribution of word frequencies in English, if the words are aligned according to their ranks, is an inverse power law with the exponent very close to 1. In other words, if the most frequently occurring word appears in the text with the frequency P(l), the next most frequently occurring word has the frequency P(2), and the rank-r word has the frequency P(r), the frequency distribution is C P(r) =-, r" (0.1 ) with C R:i 0.1 and Ct R:i 1. This distribution, also called Zipf's law, has been checked for accuracy for the standard corpus of the present-day English with very good results [2]. The fall-off of the distribution as the rank is increased is obvious, because the more frequently occurring words are guaranteed to have larger frequencies than those less frequently occurring. Nevertheless, it seems to be a puzzle as why the decay is a power law instead of an exponential function or other faster decaying functions, and why the exponent is very close to 1 instead of 2 or even larger values. There are attempts to incorporate Zipf's law into the grander framework of "fractals" [3], but in doing so, little insight has been gained in understanding this particular "law." Probably few people pay attention to a comment by Miller in his preface to Zipf's book [4], that randomly generated texts, which are perhaps the least interesting sequences and unrelated to any other scaling behaviors, also exhibit Zipf's law. What he said was that Zipf's law is not exclusive for English or any other natural languages. Miller did not give

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution

BACKGROUND Zipf's law states that the relationship between the frequency of a word in a text and its rank (the most frequent word has rank , the 2nd most frequent word has rank ,...) is approximately linear when plotted on a double logarithmic scale. It has been argued that the law is not a relevant or useful property of language because simple random texts - constructed by concatenating random...

متن کامل

Zipf's Law and Random Texts

Random-text models have been proposed as an explanation for the power law relationship between word frequency and rank, the so-called Zipf’s law. They are generally regarded as null hypotheses rather than models in the strict sense. In this context, recent theories of language emergence and evolution assume this law as a priori information with no need of explanation. Here, random texts and rea...

متن کامل

Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts

Zipf's law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf's law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with dif...

متن کامل

Zipf's Law and Avoidance of Excessive Synonymy

Zipf's law states that if words of language are ranked in the order of decreasing frequency in texts, the frequency of a word is inversely proportional to its rank. It is very reliably observed in the data, but to date it escaped satisfactory theoretical explanation. This article suggests that Zipf's law may result from a hierarchical organization of word meanings over the semantic space, which...

متن کامل

Language Learning, Power Laws, and Sexual Selection

A diagnostic of a power law distribution is that a log-log plot of frequency against rank yields a (nearly) straight line. For instance, Zipf (1935) plotted word token counts in a variety of texts against the inverse rank of each distinct word type and showed that typically such plots approximate a straight line. The characteristic ‘Zipf curve’ of word frequency against rank deviates from this ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IEEE Trans. Information Theory

دوره 38  شماره 

صفحات  -

تاریخ انتشار 1992